MAT-2000 - design, collection, and validation of a Mandarin 2000-speaker telephone speech database
نویسندگان
چکیده
Mandarin speech data Across Taiwan (MAT) is a project initiated by members of the Association for Computational Linguistics and Chinese Language Processing (ACLCLP) to collect speech data through public telephone networks in Taiwan. Totally over 7000 Taiwanese individuals have provided speech data. The results were released as a series of MAT speech databases to the research community in Taiwan. Two databases, MAT-160 and MAT-400, have been used for the first and second Assessment of Speech Recognition Technique in Taiwan. Now, release preparation of a larger database of over 2000 speakers, called MAT-2000, has been completed. In this joint project conducted by ACLCLP and Philips Research East-Asia, considerable effort has been spent on validating the database to ensure its quality. MAT-2000 consists of over 80 hours of recordings and contains about 640,000 Mandarin syllables in over 140,000 speech files. These speech files are grouped into five sub-databases for different application purposes.
منابع مشابه
Issues in Design and Collection of Large Telephone Speech Corpus for Slovenian Language
In this paper, different issues in design, collection and evaluation of the large vocabulary telephone speech corpus of Slovenian language are discussed. The database is composed of three text corpora containing 1530 different sentences. It contains read speech of 82 speakers where each speaker read in average more than 200 sentences and 21 speakers read also the text passage of 90 sentences. T...
متن کاملHKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus
The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS) from over 2100 Mandarin speakers in mainland China under the DARPA EARS framework. The corpus includes speech data, transcriptions and speaker demographic information. The speech data include 1206 ten-minute natural Mandarin conversations between either stran...
متن کاملPOLYCOST: A telephone-speech database for speaker recognition
This article presents an overview of the POLYCOST database dedicated to speaker recognition applications over the telephone network. The main characteristics of this database are: large mixed speech corpus size (> 100 speakers), English spoken by foreigners, mainly digits with some free speech, collected through international telephone lines, and more than eight sessions per speaker.
متن کاملMAT - A Project to Collect Mandarin Speech Data Through Telephone Net works in Taiwan
A cooperative project, called Polyphone, was initiated by the Coordinating Committee on Speech Databases and Speech I/O Systems Assessment (COCOSDA) in 1992. Accordingly, a project to collect Mandarin speech data across Taiwan (MAT) was conducted by a group of researchers from several universities and research organizations in Taiwan. The purpose was to generate a speech corpus for the developm...
متن کاملDevelopment of the estonian speechdat-like database
A new database project has been launched in Estonia last year. It aims the collection of telephone speech from a large number of speakers for speech and speaker recognition purposes. Up to 2000 speakers are expected to participate in recordings. SpeechDat databases, especially Finnish SpeechDat, have been chosen as a prototype for the Estonian database. It means that principles of corpus design...
متن کامل